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Abstract 

Large-scale key-value storage systems sacrifice consis- 
tency in the interest of dependability (i.e., partition- 
tolerance and availability), as well as performance (i.e., 
latency). Such systems provide eventual consistency, 
which — to this point — has been difficult to quantify 
in real systems. Given the many implementations 
and deployments of eventually-consistent systems (e.g., 
NoSQL systems), attempts have been made to measure 
this consistency empirically, but they suffer from impor- 
tant drawbacks. For example, state-of-the art consistency 
benchmarks exercise the system only in restricted ways 
and disrupt the workload, which limits their accuracy. 

In this paper, we take the position that a consistency 
benchmark should paint a comprehensive picture of the 
relationship between the storage system under considera- 
tion, the workload, the pattern of failures, and the consis- 
tency observed by clients. To illustrate our point, we first 
survey prior efforts to quantify eventual consistency. We 
then present a benchmarking technique that overcomes 
the shortcomings of existing techniques to measure the 
consistency observed by clients as they execute the work- 
load under consideration. This method is versatile and 
minimally disruptive to the system under test. As a proof 
of concept, we demonstrate this tool on Cassandra. 

1 Introduction 

Large-scale key-value storage systems are quickly be- 
coming an essential component of many IT infrastruc- 
tures. From fast-growing start-ups to large enterprises, 
these systems are becoming commonplace in production 
use because of their ability to scale easily and the avail- 
ability of many widely-supported software implementa- 
tions. However, in order to provide performance and de- 
pendability at scale, the common principle followed by 
these key-value systems is to relax data consistency [25 1. 
As these systems find their way into a wider variety of 



industries, it becomes increasingly important to under- 
stand the implications of this relaxed consistency model: 
to what extent relaxation improves system performance 
and to what extent it degrades data consistency. 

For example, Web-based applications rely on key- 
value systems to provide high-throughput and low- 
latency access to content. While these applications do 
not strictly require serializability for correct operation, 
they may require a stronger property than eventual con- 
sistency, such as causal or "causal+" [16| consistency, in 
order to improve the user experience. 

On the other hand, cloud-based health care applica- 
tions likely value predictable consistency over perfor- 
mance. Eventually consistent updates to a patient's 
record may introduce mistakes along the path of patient 
care. For example, stale information (e.g., due to weak 
consistency) about a patient's dosage or medical history 
may lead to incorrect, or — in an extreme case — harmful 
treatment plans. 

Today, cloud customers who care about consistency 
have limited means to understand or control data consis- 
tency when choosing among available storage systems, 
or their configurations. For example, decisions to tune 
"knobs" such as the replication factor or quorum size re- 
main ad-hoc, and may lead to excessive replication or 
operational costs. More importantly, no combination of 
these knob settings can ensure that the storage system 
is strongly consistent (e.g., always returns the freshest 
data). This shortcoming is a fundamental limitation of 
such always-available, partition-tolerant systems, as ob- 
served by Brewer [8 1 and formalized by Lynch et al. ifTTl . 
Moreover, many modern systems often choose to further 
sacrifice consistency for better performance 0. 

We argue that a methodology for comprehensive con- 
sistency measurement is necessary to evaluate today's 
eventually consistent systems. Such a measurement 
framework can identify the shortcomings of architec- 
tural designs or implementation errors in existing sys- 
tems. Moreover, it can determine the actual consistency 



behavior of a particular deployment, which may be help- 
ful to guide configuration and deployment decisions. 

Prior techniques for measuring consistency follow a 
methodology that is oversimplified, and as a result suffer 
from important drawbacks. For example, the act of mea- 
surement disrupts the workload by injecting operations, 
causing a troublesome "observer effect". Moreover, the 
injected operations tend to stress the system, which may 
elicit worst-case behavior even for a light workload. Un- 
derstanding observed, as opposed to worst-case, consis- 
tency is important for systems designers considering per- 
formance trade-offs, particularly if observed consistency 
is vastly different from the worst-case. 

Our position is simple — a consistency benchmark 
should produce precise and accurate measurements of 
consistency with minimal disruption to the system un- 
der evaluation. These measurements should reflect the 
consistency actually observed by clients in the work- 
load under consideration, rather than the consistency of 
operations injected artificially into the workload. Fur- 
thermore, a benchmark must collect measurements in a 
system-agnostic way, enabling comparisons not only be- 
tween different implementations of the same consistency 
model (e.g., sloppy quorums G4l ). but also between dif- 
ferent consistency models. 

In this paper, we describe a principled approach to 
consistency measurement that captures more faithfully 
and accurately the actual consistency behavior of a key- 
value storage system for an arbitrary workload. Our spe- 
cific contributions are: 

1 . A survey of known techniques for quantifying and 
benchmarking consistency, and discussion of their 
limitations (Section[2|. 

2. An outline of a more general and precise approach 
to consistency measurement (Section|3]l. 

3. A proof-of-concept benchmarking tool, which we 
use to obtain consistency measurements for the Cas- 
sandra [1] key-value store (Section|4]). 

2 Related work 

Consistency in this paper refers to the notion that dif- 
ferent clients accessing a storage system agree in some 
way on the state of data. In the literature, this is termed 
the client-centric view, as opposed to the data-centric 
view, which refers to details that are not directly observ- 
able by clients (e.g., messages in flight, state of repli- 
cas). The client-centric view is more natural in the con- 
text of benchmarking consistency, as it does not require 
system-specific and disruptive instrumentation to collect 
intimate details of the execution. Instead, it considers 
only the information that clients can capture locally as 
they apply get and put operations on keys, such as the 



start and end time of each operation as well as its argu- 
ments and response. 

Client-centric definitions of consistency typically re- 
fer to agreement on when and in which order operations 
take effect (e.g., see [22]). As we discuss shortly, early 
attempts to benchmark consistency focus on the "when", 
and interpret this question as meaning roughly "How 
soon after a write operation returns do read operations 
return the written value?", or in other words, "How even- 
tual is eventual?" lT7l[T9ll26t 

Formalizing and answering these questions precisely 
bring us a step closer to understanding the complex rela- 
tionship between the workload applied to a storage sys- 
tem, the failure pattern, the configuration parameters, 
and the observed client-centric consistency. In contrast, 
prior work covers a narrow sub-space of this multidimen- 
sional relationship that considers only failure-free execu- 
tions, and relies on an informal methodology that exer- 
cises the storage system only near the limits of its "con- 
sistency envelope". 

Definitions of version and time-based stateness 

Staleness is a fundamental concern in data management, 
and can be used to describe the quality of both the data 
and the system that stores it. In this benchmark, we focus 
on the quality of the storage system, and in particular the 
protocol synchronizing different replicas of data. To that 
end, we consider staleness as a relative measure: how 
long ago was the value read first updated (e.g., see Fig- 
ure [TJ. In other words, data becomes stale the first time 
it is overwritten by newer data. 

Prior techniques for quantifying staleness in key-value 
storage systems either count versions (e.g., the value read 
is the second-latest value written) or measure time (e.g., 
the value read is one hour older than the latest value writ- 
ten) HU [HE] US |29l. These quantities are easy 
to state precisely under the simplifying assumption that 
read and write operations are instantaneous — a collection 
of unique points on a one-dimensional axis. In that case, 
there is a natural total order on the operations, and more- 
over the "latest value" at any point in time is well defined. 
In contrast, in real-world scalable storage systems, oper- 
ational latencies due to processing, networking and I/O 
are non-trivial, and so there can be multiple operations 
in flight at any given time, even on a single key. Thus, 
non-trivial latencies and parallelism complicate reason- 
ing about when a given operation takes effect, as well as 
the order in which operations take effect relative to each 
other. 

A more precise treatment of staleness devised by the 
theory community includes the (client-centric) concepts 
of ^-atomicity H and A-atomicity [ 1 1 1. The fc-atomicity 
property is a natural formalization of version-based stal- 
eness. An execution of operations in a key-value store is 
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Figure 1 : Example calculation of A. 

A:-atomic if the operations in that execution can be totally 
ordered so that: (1) the total order extends the "happens 
before" partial order (i.e., if operation A ended before op- 
eration B began during the execution, then A precedes B 
in the total order); and (2) each read returns the value as- 
signed by one of the k most recent writes preceding the 
read in the total order. (In the case k = 1, fe-atomicity 
corresponds to Lamport's atomicity concept [15], which 
we discuss below.) For any given execution, we can 
quantify version-based staleness by solving the follow- 
ing optimization problem: find the smallest k for which 
the execution is A:-atomic. We are not aware of an effi- 
cient (i.e., poly-time) solution to this problem, although 
ifTTl presents progress toward solving the corresponding 
decision problem for k = 2. 

The A-atomicity property attempts to capture time- 
based staleness by stating that read operations must re- 
turn values that are at most A time units staler than the 
latest value for a key. More formally, if we "stretch" the 
start time of each read to a point A time units earlier, 
then the resulting execution should be atomic in Lam- 
port's sense [15|. For any given execution, it is possible 
to compute the smallest A > for which that execution 
is A-atomic using an efficient algorithm ifTTIl . 

Figure[T]illustrates A in action. The start and end times 
are shown for three writes and two reads, all operating 
on the same key. We assume that each operation takes 
effect between its beginning and end. For example, 2 is 
the latest value from the moment write(2) ends, and pos- 
sibly even earlier. Thus, read(l) returns a value that is 
stale by at least the width of the "gap" between it and 
write(2). Even though write(3) is the latest value, stale- 
ness for read(l) is measured from the first unseen update 
to the key: write(2). Similarly, the staleness for read(2) 
is measured from the end of write(3). 

For completeness, we also briefly discuss well-studied 
notions of weakly consistent shared objects from dis- 
tributed computing theory literature. Lamport proposed 
the notions of safe, regular and atomic registers (i.e., 
shared objects that support read and write operations). 



These specifications describe the correct behavior of 
read operations when they can execute concurrently with 
writes and with each other, but do not adequately cap- 
ture the possibility that non-concurrent operations may 
appear to take effect out of order — a commonplace phe- 
nomenon in modern quorum-replicated systems. Lam- 
port's atomicity property is similar in spirit to Herlihy 
and Wing's linearizability [12] and Papadimitriou's strict 
serializability [ 18] for read/write register objects. 

Measuring and bounding staleness 

Several papers attempt to measure or bound staleness in 
order to characterize the spectrum of trade-offs surround- 
ing Brewer's celebrated CAP principle EKSJ. Wada et al. 
11261 measure time-based staleness in cloud storage plat- 
forms by writing timestamps to a key from one client 
three times per second, reading the same key from an- 
other client fifty times per second, and computing the 
difference between the reader's local time and the times- 
tamp read. In experiments using Amazon's SimpleDB 
0, they observe staleness on the order of seconds. 

The methodology of Wada et al. is sufficient to ob- 
tain evidence relevant to their central research question — 
whether cloud storage systems in practice provide more 
consistency than they promise. However, their technique 
also has several disadvantages as a consequence of exer- 
cising the system in an artificial way. First, the measure- 
ment is disruptive because it introduces additional write 
operations to the workload. This is unsuitable in a pro- 
duction environment, unless the operations are applied to 
a special "dummy" key, in which case the outcome may 
not predict accurately the staleness observed by reads on 
the other keys. Secondly, the technique is pessimistic be- 
cause it considers a pattern of access where read opera- 
tions occur back-to-back with writes. This measurement 
captures the minimum time needed for replicas of a key- 
value pair to synchronize, but in a real world workload, 
gaps between operations may result in clients observing 
far less staleness. In particular, if the load is trivial then 
it is possible that all operations (except the ones injected 
artificially) will be atomic. A third drawback is the use of 
only a single writer. While this certainly simplifies cal- 
culations, the measurements obtained may fail to cover 
special execution paths of the storage system for dealing 
with concurrent writes, which hurts accuracy further. 

Bermbach et al. El and Patil et al. Q9) measure stal- 
eness using techniques similar to Wada et al. The latter 
paper presents an extension of the Yahoo Cloud Serving 
Benchmark (YCSB) [9|, with support for basic consis- 
tency benchmarking. Their technique relies on a middle- 
ware service, namely ZooKeeper fT3"l . to convey timing 
information between readers and writers. This technique 
is limited in precision due to the latency introduced by 
operations on ZooKeeper, and hence it produces results 
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with one-sided error: reported consistency violations are 
true assuming synchronized clocks, but lack of reported 
violations does not imply atomic behavior. 

Bailis et al. (6) consider the problem of predicting 
the staleness from an abstract model of the storage sys- 
tem, including details such as the distribution of laten- 
cies for network links. This work considers both ver- 
sion and time-based staleness, and provides an upper 
bound on the probability that a client observes stale 
data. This prediction, similar to the measurements of 
Wada et al., may be overly pessimistic for light work- 
loads. Predicting and measuring staleness are comple- 
mentary techniques — prediction can be used for planning 
and measurement can be used in a variety of ways, such 
as performance tuning, monitoring, evaluating service- 
level agreements, and feedback control. 

Other work 

Shapiro et al. formalize eventual consistency for shared 
objects that avoid conflicts by design, for example by 
providing commutative operations ll20l |2TI . Conven- 
tional key-value storage systems, like Cassandra, fall 
outside this category because write operations are inher- 
ently conflict-prone. Zhu et al. (30| give formal defini- 
tions of eventual consistency for read/write storage sys- 
tems, as well as several client-centric properties: read- 
your-writes, monotonic reads, writes follow reads, and 
monotonic writes. This work does not provide a way to 
measure the difference between a particular consistency 
property and the actual consistency delivered by a stor- 
age system. Less formal definitions of eventual consis- 
tency appear in numerous papers (e.g., (23, 25 1). 

3 Toward a benchmarking framework 

We focus on creating a client-centric benchmarking tool 
that measures observed consistency and is minimally dis- 
ruptive to the system under evaluation. Since consistency 
and fault-tolerance are intimately related in eventually 
consistent systems, the tool should provide support for 
fault injection. This includes crashes (individual and cor- 
related) as well as network partitions, and necessitates 
"white-box" access to the infrastructure. Finally, the tool 
should simplify analysis of the results by presenting use- 
ful visualizations to the user. 

As a stepping stone towards building a comprehensive 
benchmarking framework, we now describe a method- 
ology for minimally disruptive measurement of consis- 
tency in arbitrary workloads. We then suggest how such 
measurements might be visualized. Since our methodol- 
ogy is client-centric, it can be married with any workload 
generator. The measurement entails collecting timing in- 
formation at clients for an arbitrary interleaving of opera- 
tions, and calculating consistency metrics only from this 



information using theoretically-sound techniques. As a 
running example, we consider the calculation of the A 
quantity described in Section|2] and then discuss integra- 
tion with YCSB GO. 

A-atomicity is defined abstractly for arbitrary execu- 
tions, including ones containing concurrent writes to the 
same key. To quantify staleness, we propose to calculate 
A for a given execution using the procedure described 
in ifTTI . First, we group operations into clusters — sets 
of operations that access the same key and read or write 
same value flOl . For example, in Figure[T]there are three 
clusters, red, blue and green, corresponding to the val- 
ues 1, 2 and 3. Next, we choose a key k and for each 
pair of clusters for that key, and we determine the stale- 
ness due to the interaction of operations in these clusters 
by evaluating a scoring function % ATI . We omit the 
formal details and point out only that in Figure |T| % is 
the width of the staleness "gaps" experienced by read(l) 
and read (2). Finally, we compute the A value for key 
k by taking the maximum of % over all pairs of clusters 
for k. We repeat for each key and, taking the maximum, 
obtain a global A indicating the staleness for the entire 
execution. Note that since the calculation combines time 
values from multiple hosts, accuracy is contingent upon 
synchronized clocks. 

The quantities % and A can be displayed visually in 
various ways. For example, using A's for different keys, 
we can plot a histogram that shows what proportion of 
the key space was read in a consistent manner. Or, using 
X values for one key, we can plot a histogram that shows 
what proportion of clusters contained reads of stale val- 
ues (which, in turn, estimates what proportion of reads 
returned stale values). We can also use a time series plot 
of X t° visualize the instantaneous consistency in an ex- 
ecution, which indicates the staleness of values read at 
different points in time. This allows us to observe how 
staleness varies over time (e.g., in response to load spikes 
or failures), information that is masked by A alone since 
it quantifies consistency for the duration of an entire ex- 
ecution. Note that x and A, as well as the corresponding 
visualizations, can be obtained for a subset of the key 
space (e.g., chosen through random sampling). 

4 Experimental evaluation 

To demonstrate our benchmarking methodology, we in- 
tegrated our consistency measurement technique into 
YCSB [9 1, and used the modified YCSB to measure con- 
sistency in Cassandra [1], a widely adopted key-value 
storage system. Our experiments use YCSB v. 0. 1 .4 and 
Cassandra v. 1.1.0. 

The experimental hardware platform is a cluster of ten 
commodity dual-socket 6-core Xeon servers equipped 
with lGigE network interface cards and 96GB DRAM. 
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Figure 2: Histogram of score function (%) values. 
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Figure 3: Time series plot of score function (x) values. 



Each server ran a 32-thread YCSB client on one socket, 
and a Cassandra node on the other socket, configured 
with default options except as follows: keys were hashed 
uniformly across all nodes and 3-way replicated using 
the "simple" replica placement strategy |2|. By default, 
the Cassandra connector in YCSB used consistency level 
"ONE" for both reads and writes. This consistency level 
requires that a write be applied to the commit log and 
memory table of at least one replica node before return- 
ing to the client, and allows a read to return the value 
obtained from the first replica that responds. 

We instrumented the YCSB source code to log tim- 
ing information for each operation using a millisecond- 
precision clock. We pre-loaded Cassandra with 1000 
keys and applied a read-heavy (80% get, 20% put) work- 
load for 60 seconds. The keys were drawn from YCSB's 
"hot spot" distribution, with 80% of the operations going 
to a subset of hot keys comprising 20% of the key space. 

We computed % and A from collected timing informa- 
tion, as described in Section [3] Figure [2] is a histogram 
of positive % values for all keys. Each point represents 
the relative staleness observed by some read operation 
on some key. The value of % ranges from 1ms to 233ms, 
and the margin of error due to clock skew is around 1ms. 
In comparison, Wada et al. report much higher maxi- 
mum staleness levels in their experiments using Ama- 
zon's SimpleDB (see Figures 2 and 3 in [26 1). 



Figure [3] shows a time series plot of the % values for 
all keys). This visualization allows us to observe how 
staleness varies over time, in contrast to the distribution 
of staleness values captured in Figure [2] In Figure [3] the 
x-axis depicts the approximate time when a read returns 
a stale value, and the y-axis depicts the corresponding % 
value. Most of the data points are concentrated near the 
x-axis, as we expect based on the histogram, and further- 
more there are a few visible "inconsistency spikes". 

Finally, we measured the overhead of instrumentation 
that is required to compute the staleness metric and ob- 
served a performance loss of less than five percent with 
instrumentation enabled. 

5 Conclusions and future work 

In this paper, we present a client-centric benchmarking 
methodology for understanding eventual consistency in 
distributed key- value storage systems. Our methodology 
measures observed, rather than worst-case, consistency. 
It extends the popular YCSB benchmark to measure the 
staleness of data returned by reads using the concept of 
A-atomicity 1 1 1 1. Because our technique does not inject 
operations into the workload, it measures consistency in 
a more faithful manner than prior benchmarks. By mea- 
suring consistency in a system-agnostic manner, we pro- 
vide a quantitative methodology for examining the per- 
formance vs. consistency trade-offs across various key- 
value system architectures. 

Using a preliminary implementation of our methodol- 
ogy, we demonstrate that the staleness of data in Cassan- 
dra exhibits a long and thin tail. That is, the worst-case 
staleness is much higher than the typical staleness of data 
returned by read operations. This observation has impli- 
cations for a system administrator when deciding how to 
configure or deploy a system like Cassandra — depending 
on the desired performance and deployment size, the 
choice of replication factor and quorum sizes can be 
guided by our benchmark results rather than guesswork. 

We are actively extending our work to consider runs 
with failures. Events such as network partitions, software 
crashes, or device failures may trigger special execution 
paths in the system and result in different consistency be- 
haviors. Our goal in future work is to stage experiments 
involving such failures through additional modifications 
to the A-enabled YCSB suite. 
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